{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 5 Pandas\n", "When working with large complex datasets it is often useful to be able to manipulate the data in a variety of ways. Numpy is great for doing calculations on arrays but it is not so good at manipulating data. \n", "\n", "## 5.1 Series and Dataframes\n", "\n", "Pandas is a python package which builds on top of Numpy. It is designed to allow fast and efficient manipulation of data structures. It introduces two basic datatypes:\n", "\n", "1. A Series - a 1d array\n", "2. A Dataframe - a 2d array \n", "\n", "These datatypes are built on top of Numpy but the main difference is that all the data is not just indexed by sequential numbers but has labels. This will become clearer as we study them.\n", "\n", "### 5.1.1 Creating Series and DataFrames\n", "\n", "We can create both Series and DataFrames from other datatypes like lists, numpy arrays or dictionaries. See example below or the docs for full details (Series and DataFrames)\n", "\n", "*Be aware that when Pandas switched to v2 certain functions etc were deprecated (ie no longer available). There are lots of examples on the internet still using old methods which won't work. Check the docs if you are unsure*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "#Create a series\n", "my_data = [3,5,7,9,11]\n", "data_series = pd.Series(my_data)\n", "\n", "#A dataframe\n", "other_data = {'age':[20, 24, 24, 50],'name':['Jane','Tom','Lucy','Bob'], 'gender':['Female', 'Male','Female', 'Male']}\n", "df = pd.DataFrame(other_data) # You'll see that people often write the dataframe variable as df. \n", "\n", "print(data_series)\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As well as the values we put in, there is an additional column to the left which is called the index. We didn't specify index labels, so pandas has created values which are just sequential numbers. However, part of the power of Pandas is we can specify a meaningful index when we create it or set index to a column later." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df=pd.DataFrame(other_data, index=['a','c','e','g'])\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.1.2 Selecting Data in a DataFrame\n", "\n", "This labelled index provides us with two ways of referring to the row. We can index by location (`df.iloc[row, col]`) in a manner similar to numpy. ie the first row is row 0, the second is row 1. The columns can also be indexed as 0 and 1. However, we can also index by label (`df.loc[row_label, col_label]`). It is possible that more than one row may have the same label and this second method will select both. \n", "\n", "*N.B. In Pandas wherever you want to select multiple values you need to add an extra pair of brackets.*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('Indexing by location')\n", "display(df.iloc[0]) # Get the first row\n", "display(df.iloc[:,0]) # Get the first column. N.B double brackets\n", "display(df.iloc[[1,2],[0,1]]) # Get the second and first row's values in first and second column. N.B double brackets\n", "display(df.iloc[0,1]) # Get item at first row and second column\n", "\n", "print('\\nIndexing by label')\n", "display(df.loc['a']) # Get row labelled a\n", "display(df.loc[:,'name']) # Get column labelled name --> returns a Series\n", "display(df.loc[:,['name','age']]) # Get columns labelled name and age --> returns a Dataframe. \n", "display(df.loc['a','name']) # value with row labelled a and column labelled name" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.1.3 Modifying DataFrames\n", "\n", "We can create new bits of a DataFrame or Series and modify the existing ones. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "age_series = df['age'] # Select a single column as a series \n", "new_df = df[['age','name']] # Again notice the extra brackets\n", "\n", "df['subject'] = ['maths','biology','computer science','physics'] # Add a new column. Use list, np.array but length must match original df\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also apply conditional filtering of the rows like this" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "names_with_o = df[df['name'].str.contains('o')] # A Series has a .str property which allows you to operate on all the strings in a column\n", "display(names_with_o)\n", "\n", "young_women = df[(df['age'] < 25) & (df['gender'] == 'Female')] # We can use any logic to filter. Find Women under 25.\n", "display(young_women)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since each column is effectively a Numpy array we can also do mathematical and statistical operations on them." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "display(df['age']*3 + df['age'])\n", "df['age'].mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.2 Data input and output\n", "### 5.2.1 Reading Data from Files\n", "\n", "Whilst we can generate Series and Dataframes like this, it is more common to read a large dataset from a file or database. Pandas accepts many formats. For now we will work with csv files, but you could also use Excel, hdf5 (a high density format), sql etc for which there are similar methods. As an example we are going to work with the public Covid-19 dataset from 2019 maintained by John Hopkins University (dataset). The file is included in `resources / textfiles / owid-covid-data.csv`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_covid = pd.read_csv('resources/textfiles/owid-covid-data.csv') \n", "df_covid\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas displays the beginning and the end of the dataframe but all the data has been loaded into memory for us to work with.\n", "\n", "There are a whole host of options upon reading a file so it is always worth looking at the docs. Not all the columns above are of interest. We also want to pick out the data so we might want to set the index to a sensible column. The method `df.head(n=numberrows)` also allows us to look at a small section of the data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_covid = pd.read_csv('resources/textfiles/owid-covid-data.csv', index_col='date',usecols=['location','continent', 'date', 'total_cases','new_cases','total_deaths','total_cases_per_million','hospital_beds_per_thousand','life_expectancy','population'])\n", "df_covid_1stMay = df_covid.loc['2020-05-01'] #The dataset contains an entry for every day so lets look at one specific date.\n", "df_covid_1stMay.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.2.2 Writing Data to files\n", "\n", "We can also write dataframes to file very easily pd.to_csv docs. Similar methods exist to write to other formats." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "time = np.linspace(0,4*np.pi,100)\n", "df_example = pd.DataFrame({'time':time,'amplitude':np.sin(0.3*time)})\n", "df_example.to_csv('resources/textfiles/write_example.csv', index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.3.1 Returning vs inplace\n", "\n", "Many methods have a keyword argument called `inplace`. By default this is usually `False` meaning that the method will return a new DataFrame leaving the original unchanged ie:\n", "\n", " new_df = pd.pandas_method(old_df)\n", "\n", "However, if you set `inplace=True` this modifies the original DataFrame, so there is no return value. This can be really good for memory and speed, but it comes with the trade off that you've changed the original data which you might want for something else.\n", "\n", " pd.pandas_method(df, inplace=True)\n", "\n", "If you get odd things in your code, where you are finding that you seem to have lost your DataFrame and python is complaining about `NoneTypes` it may be you've written something like:\n", "\n", " new_df = pd.pandas_method(df, inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.3.2 Data Cleaning\n", "\n", "In any realworld dataset the data is usually not perfect and maybe not in the format we want. There may be column data we are not interested in. For this example we'll pick `new_cases`. We can also see that some of the values are not present and have been filled with NaN. This is the same as `np.nan` we saw before. We have to think carefully about the most appropriate option. There are several options but the decision is yours and depends on your data and task:\n", "\n", "1. Leave them:\n", "\n", " NaNs can be helpful because they are not included in calculations of the mean or median of a column and hence don't skew the results by including 0 or some other value. \n", "\n", "2. Replace them with another value:\n", "\n", " You can replace the NaN values with a substitute value, perhaps 0 or something else appropriate\n", "\n", "3. Remove the row or column where the NaNs appear." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#NB if I try and run this cell more than once it will error as the data has been changed. To rerun I need to rerun the cell above\n", "\n", "#Drop the column \"new_cases\"\n", "df_covid_1stMay.drop(columns='new_cases', inplace=True) # Note no return the original df is modified due to inplace=True\n", "\n", "#Replace the NaNs in hospital_beds with 1\n", "df_covid_1stMay.loc[:,'hospital_beds_per_thousand'].fillna(value=1,inplace=True)\n", "\n", "# Drop those rows where the column data in either total_cases or life_expectancy contains a NaN.\n", "df_covid_1stMay.dropna(inplace=True, subset=['total_cases','life_expectancy'], how='any') \n", "\n", "df_covid_1stMay.head(n=5)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.3.3 Copy vs Views\n", "\n", "Pandas did what we asked but there were some warnings. What are those about?\n", "\n", "When we looked at the python datatype `list` we saw that if one list was equal to the other, then changes to one list affected the other. In Pandas there is a slightly complicated subtelty. Sometimes Pandas returns a *copy* (ie an independent set of data) and sometimes a *View*. A *View* is similar to the effect we saw with lists in that it accesses the same data and hence changes in the *View* will also affect the original. See the example below. It's not always obvious, when Pandas does which but it's something to be aware of if you get odd things happening. See this discussion" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "other_data = {'age':[20, 24, 24, 50],'name':['Jane','Tom','Lucy','Bob'], 'gender':['Female', 'Male','Female', 'Male']}\n", "df = pd.DataFrame(other_data, index=['a','c','e','g']) # You'll see that people often write the dataframe variable as df. \n", "\n", "view = df['name']\n", "view['g'] = 'Martha' # Change a value on the Series view but it alters the original DataFrame.\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.4 Aggregation and grouping\n", "\n", "Often we want to group our data and work out the properties on those datasets. For example, returning to the Covid dataset, let's group the countries together by continent. We will then work out for each continent the max, min and average life expectancy.\n", "\n", "pandas groupby" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Create a groupby object\n", "continent_groupby = df_covid.groupby('continent')\n", "print(continent_groupby.groups.keys()) # Display the categories\n", "\n", "#For each group of countries calculate the max, min and mean life expectancy\n", "display(continent_groupby['life_expectancy'].max())\n", "display(continent_groupby['life_expectancy'].min())\n", "display(continent_groupby['life_expectancy'].mean())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.5 TimeSeries Data\n", "\n", "Another common class of tasks are operations on data that forms a sequential sequence. Often this is data that varies over time. For example, we might want to calculate the running average on some noisy data as a way of trying to smooth it. Our original covid dataset contains the number of new cases every day, for each country. We might expect that the fluctuations from day to day would make this data less useful than a weekly figure. Lets try and calculate the number of new cases in the UK summed over each week. To do that we will take the 7 day rolling average.\n", "\n", "pandas rolling" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import numpy as np\n", "\n", "# Extract data for the UK\n", "df_uk = df_covid[df_covid['location'] == 'United Kingdom']\n", "df_weekly=df_uk['new_cases'].rolling(window=7, center=True).sum()\n", "display(df_weekly.loc['2023-05-01':'2023-05-10']) # Note using slicing on row label indices beacuse they are sequential indices." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#We can then see the effect of the smoothing\n", "fig, ax = plt.subplots()\n", "\n", "ax.plot(7*df_uk['new_cases'],'r-') # Original data scaled to compare, Note you can send a Series directly to Matplotlib\n", "ax.plot(df_weekly,'b-') # We can supply a Series directly to maplotlib. It uses index as x values.\n", "\n", "xticks = df_weekly.index[::300] # Bit of slicing to generate smaller number of labels for x axis\n", "ax.set_xticks(xticks)\n", "ax.set_xlabel('Date')\n", "ax.set_ylabel('Total New Cases')\n", "ax.set_title('New Covid Cases UK')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.6 Joining DataFrames\n", "\n", "There are various different ways to join dataframes together. Pandas docs here cover this topic in a lot of detail. Some of the diagrams I've used are taken from there. There are 2 main different ways to do this:\n", "\n", "### 5.6.1 Concatenation \n", "Concatenation joins DataFrames \"vertically\". The diagram below shows how we might join 3 dataframes df1, df2 and df3 of the same width and same columns.\n", "\n", "
\"Drawing\"
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# A simplified example of the above.\n", "df1 = pd.DataFrame({'A':[1,2,3],'B':[2,3,4],'C':[3,4,5]}, index=[1,2,3])\n", "df2 = pd.DataFrame({'A':[4,5,6],'B':[5,6,7],'C':[6,7,8]}, index=[2,3,4])\n", "\n", "display(pd.concat([df1,df2])) # Sometimes the values in the index will clash. Sometimes you want this - The index info is important.\n", "display(pd.concat([df1,df2], ignore_index=True)) # Sometimes you don't - You want each value to have its own reference.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.6.2 Merging\n", "\n", "There are many different options for joining DataFrames. A merge is often used to join DataFrames horizontally. This can be really helpful if you have two datasources which contain related information. By merging them we can easily work out how the data is related.\n", "\n", "If DataFrames have a common index or column, you can use this to join them. They don't have to have the same number of rows. Pandas will use the rows that are the same and produce NaNs in positions where values don't exist. You can also join on multiple columns. Here things get a bit messy. You can use the values in the joining column or columns retaining the rows in either the left dataframe or the right dataframe, the common values ('inner') or all values ('outer').\n", "\n", "Perhaps easiest to see this as an example:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# A simplified example of the above.\n", "df1 = pd.DataFrame({'Name':['Bob','Mike','Jane'],'Subject':['Physics','Maths','Chemistry']}, index=[1,2,3])\n", "df2 = pd.DataFrame({'Name':['Mike','Jane','Luke'],'Year':[1,2,3]}, index=[2,3,4])\n", "display(df1)\n", "display(df2)\n", "\n", "print('merge results')\n", "\n", "display(pd.merge(df1,df2,on='Name', how='left')) # Only keep the rows where the Name exists in df1\n", "display(pd.merge(df1,df2,on='Name', how='right')) # Only keep the rows where the Name exists in df2\n", "display(pd.merge(df1,df2,on='Name', how='inner')) # Only keep the rows where the Name exists in both df1 and df2\n", "display(pd.merge(df1,df2,on='Name',how='outer')) # keep all the rows\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.7 Speeding things up in Pandas\n", "\n", "There are several simple things that you can do to make your Pandas code run faster:\n", "\n", "1. Just as with Numpy, you should avoid writing code that performs an operation on a DataFrame by running a for loop on each row sequentially. The in-built Pandas functions will be much faster than using python code. \n", "\n", "2. Where possible use the `inplace=True` as a keyword argument to functions. This modifies the original data but generally runs faster.\n", "\n", "3. Use the index to access rows rather than boolean selection (see example below).\n", "\n", "4. The order of the index can make a huge difference. Sorting it into order and then using the index to access data can result in huge speed ups. The example below shows that once sorted there is a speed up from about 25ms to 200μs!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Working with the df_uk frame\n", "df_uk = df_covid[df_covid['location'] == 'United Kingdom']\n", "df_uk.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%timeit # This example uses .loc which is quicker (a bit) than the next cell where we access the same thing using boolean indexing\n", "df_uk.loc['2022-06-01']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%timeit\n", "df_uk[df_uk.index=='2022-06-01']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%timeit\n", "# However there is a big difference if we sort the index. The first example does the same as above, using the full df_covid dataset. Here, the index of the dataframe is sorted by the location and then by the index date and is pretty slow.\n", "df_covid['location'].loc['2021-05-03']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%timeit\n", "# Sorting the dataframe takes a long time but if we want to run the operations on the sorted dataframe multiple times, it is better to sort it once and then use it.\n", "df_covid2 = df_covid.copy()\n", "df_covid2.sort_index(inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%timeit\n", "# Now the dataframe is sorted the .loc operation is really fast.\n", "df_covid2['location'].loc['2021-05-03']" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 4 }